-Enter the examples in code chunks and run them
Read R4ds Chapter 10: Tibbles, sections 1-3.
Load tidyverse package
library(tidyverse)
Enter code chunks and describe what each chunk does
coercing datafram iris to a tibble
as_tibble(iris)
This shows a tibble of iris data
Create a new tibble from individual vectors
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
Use non-syntactic names in tibble
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
This is for variables with unusual characters.
make a transported tibble with tribble
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
Tribble makes data easy to read from.
Enter and describe what each code chunk does
This code chunk shows tibble refined prints method that shows first 10 rows only, and all columns that fit on screen.
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
This chunk shows how to have large dataframes, that wont overwhelm the computer.
This controls the rows and columns in default tibble.
nycflights13::flights %>%
print(n = 10, width = Inf)
Use Rstudio’s built-in data viewer to get complete dataset
nycflights13::flights %>%
View()
Assigns information to df variable
df <- tibble(
x = runif(5),
y = rnorm(5)
)
Using Df to extract variable by name
df$x
[1] 0.23894038 0.31456439
[3] 0.55283884 0.31791088
[5] 0.02080429
Repeat step above, but with a different method
df[["x"]]
[1] 0.23894038 0.31456439
[3] 0.55283884 0.31791088
[5] 0.02080429
Repeat step above, but extract by position using DF
df[[1]]
[1] 0.23894038 0.31456439
[3] 0.55283884 0.31791088
[5] 0.02080429
1. How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame).
An object in tibble will display the first 10 rows only, unless told otherwise.
2. Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Why might the default data frame behaviors cause you frustration?
df <- data.frame(abc = 1, xyz = "a")
df$x
[1] "a"
df[, "xyz"]
[1] "a"
df[, c("abc", "xyz")]
This code has issues displaying the variables and the rows. Since (’) isnt being used, tibble puts the variable on a separate line from dataframe.
Read R4ds Chapter 11: Data Import, sections 1, 2, and 5.
Load tidyverse if havent prior to chapter 11
Do not run the first code chuck of this section
Use read_csv() to print out a specific column that gives name and type of each column
read_csv("a,b,c
1,2,3
4,5,6")
This code chunk skips the first n lines
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
This code chunk shows comment = ‘#’ to skip numbers after 3
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
This code chunk uses col_names when not want ing to use column names
read_csv("1,2,3\n4,5,6", col_names = FALSE)
Code chunk shows how to pass col_nams to vectors for column names
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
Code chunk shows how to use na for missing values
read_csv("a,b,c\n1,2,.", na = ".")
1. What function would you use to read a file where fields were separated with “|”?
read.delim()
2. (This question is modified from the text.) Finish the two lines of read_delim code so that the first one would read a comma-separated file and the second would read a tab-separated file.
‘file <- read_delim(“file.csv”, “,”)’
‘file <- read_delim(“file.csv”, " ")’
3. What are the most important arguments to read_fwf()?
file shows the file to be extracted or created
col_positions defines the column position and data arrangement
4. Skip
5. Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
read_csv("a,b\n1,2,3\n4,5,6")
2 parsing failures.
row col expected actual file
1 -- 2 columns 3 columns literal data
2 -- 2 columns 3 columns literal data
read_csv("a,b,c\n1,2\n1,2,3,4")
2 parsing failures.
row col expected actual file
1 -- 3 columns 2 columns literal data
2 -- 3 columns 4 columns literal data
read_csv("a,b\n\"1")
2 parsing failures.
row col expected actual file
1 a closing quote at end of file literal data
1 -- 2 columns 1 columns literal data
read_csv("a,b\n1,2\na,b")
read_csv("a;b\n1;3")
Line 1: shows column 3 missing name, and no column is shown, as there is a file error. Line 2: the 1st n adds a new line, but row data is missing. second n has to much data Line 3: data row is out of place from “1” Line 4: a,b already have column values, dont need another row Line 5: semi-colons need read_csv2
Just read this section. You may find it helpful in the future to save a data file to your hard drive. It is basically the same format as reading a file, except that you must specify the data object to save, in addition to the path and file name.
Read R4ds Chapter 18: Pipes, sections 1-3.
Nothing to do otherwise for this chapter. Is this easy or what?
Note: Trying using pipes for all of the remaining examples. That will help you understand them.
Read R4ds Chapter 12: Tidy Data, sections 1-3, 7.
reload “tidyverse” if logged out of program priror to this chapter
Study Figure 12.1 and relate the diagram to the three rules listed just above them. Relate that back to the example I gave you in the notes. Bear this in mind as you make data tidy in the second part of this assignment.
You do not have to run any of the examples in this section.
Run through the examples
2. Why does this code fail? Fix it
table4a %>%
pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")
There is an error in the subset of the columns. The postion needs to be between 0 and n.
Read R4ds Chapter 5: Data Transformation, sections 1-4.
Time to get small.
Load necessary libraries.
Study Figure 5.1 carefully. Once you learn the &, |, and ! logic, you will find them to be very powerful tools.
library(nycflights13)
flights
1 Find all flights that:
1.1 Had an arrival delay of two or more hours
filter(flights, dep_delay >=120)
1.2 Flew to Houston (IAH or HOU)
filter(flights, dest == "IAH" | dest == "HOU")
1.3 Were operated by United, American, or Delta
filter(flights, carrier == "UA"| carrier == "AA"| carrier == "DL")
1.4 Departed in summer (July, August, and September)
filter(flights, month == "7"| month == "8"| month == "9")
1.5 Arrived more than two hours late, but didn’t leave late
filter(flights, dep_delay == 0 & arr_delay >= 120)
1.6 Were delayed by at least an hour, but made up over 30 minutes in flight
filter(flights, dep_delay >= 60 & arr_delay <=30)
1.7 Departed between midnight and 6am (inclusive)
filter(flights, dep_time >= 0000 & dep_time <=600)
2 Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
This allows a short cut for the code chunk, shown below.
filter(flights, between(dep_time, 0, 600))
3 How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
sum(is.na(flights$dep_time))
[1] 8255
filter(flights, is.na(dep_time))
They may represent the times these flights left, which are NA, so they may not have times avaliable.
4 Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)
Na^0 is zero. NA|TRUE is always true, so not missing. Anything that is FALSE is FALSE.
1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).
arrange(flights, desc(is.na(dep_delay)))
2. Sort flights to find the most delayed flights. Find the flights that left earliest.
Flights with the greatest delay
arrange(flights, desc(dep_delay))
Flights with the earliest leave
arrange(flights, dep_delay)
3. Sort flights to find the fastest (highest speed) flights.
arrange(flights, air_time)
4. Which flights traveled the farthest? Which traveled the shortest?
arrange(flights, desc(distance))
arrange(flights, distance )
1: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights. Find at least three ways.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, starts_with('dep'), starts_with('arr'))
2: What happens if you include the name of a variable multiple times in a select() call?
The repeated variables will be removed from code chunk
3: What does the one_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
This allows for you to pick select sections from dataframe.
4: Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
`select(flights, contains("TIME"))`
It surprised me that it didnt work. You need to add ignore_case.